Overview

Dataset statistics

Number of variables12
Number of observations550068
Missing cells556885
Missing cells (%)8.4%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory193.8 MiB
Average record size in memory369.4 B

Variable types

NUM6
CAT5
BOOL1

Reproduction

Analysis started2020-07-03 09:40:06.074475
Analysis finished2020-07-03 10:19:38.180536
Versionpandas-profiling v2.6.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
Product_ID has a high cardinality: 3631 distinct values High cardinality
Product_Category_2 has 173638 (31.6%) missing values Missing
Product_Category_3 has 383247 (69.7%) missing values Missing
Occupation has 69638 (12.7%) zeros Zeros

Variables

User_ID
Real number (ℝ≥0)

Distinct count5891
Unique (%)1.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1003028.842
Minimum1000001
Maximum1006040
Zeros0
Zeros (%)0.0%
Memory size4.2 MiB

Quantile statistics

Minimum1000001
5-th percentile1000329
Q11001516
median1003077
Q31004478
95-th percentile1005747
Maximum1006040
Range6039
Interquartile range (IQR)2962

Descriptive statistics

Standard deviation1727.591586
Coefficient of variation (CV)0.001722374784
Kurtosis-1.195500781
Mean1003028.842
Median Absolute Deviation (MAD)1499.97937
Skewness0.003065551851
Sum5.517340693e+11
Variance2984572.686
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[1000001. 1000002.5 1000004.5 1000005.5 1000006.5 ... 1006036.5 1006037.5 1006038.5 1006039.5 1006040. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1001680 1026 0.2%
 
1004277 979 0.2%
 
1001941 898 0.2%
 
1001181 862 0.2%
 
1000889 823 0.1%
 
1003618 767 0.1%
 
1001150 752 0.1%
 
1001015 740 0.1%
 
1005795 729 0.1%
 
1005831 727 0.1%
 
Other values (5881) 541765 98.5%
 
ValueCountFrequency (%) 
1000001 35 < 0.1%
 
1000002 77 < 0.1%
 
1000003 29 < 0.1%
 
1000004 14 < 0.1%
 
1000005 106 < 0.1%
 
ValueCountFrequency (%) 
1006040 180 < 0.1%
 
1006039 74 < 0.1%
 
1006038 12 < 0.1%
 
1006037 122 < 0.1%
 
1006036 514 0.1%
 

Product_ID
Categorical

HIGH CARDINALITY
Distinct count3631
Unique (%)0.7%
Missing0
Missing (%)0.0%
Memory size4.2 MiB
P00265242
 
1880
P00025442
 
1615
P00110742
 
1612
P00112142
 
1562
P00057642
 
1470
Other values (3626)
541929
ValueCountFrequency (%) 
P00265242 1880 0.3%
 
P00025442 1615 0.3%
 
P00110742 1612 0.3%
 
P00112142 1562 0.3%
 
P00057642 1470 0.3%
 
P00184942 1440 0.3%
 
P00046742 1438 0.3%
 
P00058042 1422 0.3%
 
P00145042 1406 0.3%
 
P00059442 1406 0.3%
 
Other values (3621) 534817 97.2%
 

Length

Max length9
Mean length8.982729408
Min length8
ValueCountFrequency (%) 
Decimal_Number 10 90.9%
 
Uppercase_Letter 1 9.1%
 
ValueCountFrequency (%) 
Common 10 90.9%
 
Latin 1 9.1%
 
ValueCountFrequency (%) 
ASCII 11 100.0%
 

Gender
Categorical

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size4.2 MiB
M
414259
F
135809
ValueCountFrequency (%) 
M 414259 75.3%
 
F 135809 24.7%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Uppercase_Letter 2 100.0%
 
ValueCountFrequency (%) 
Latin 2 100.0%
 
ValueCountFrequency (%) 
ASCII 2 100.0%
 

Age
Categorical

Distinct count7
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size4.2 MiB
26-35
219587
36-45
110013
18-25
99660
46-50
45701
51-55
 
38501
Other values (2)
 
36606
ValueCountFrequency (%) 
26-35 219587 39.9%
 
36-45 110013 20.0%
 
18-25 99660 18.1%
 
46-50 45701 8.3%
 
51-55 38501 7.0%
 
55+ 21504 3.9%
 
0-17 15102 2.7%
 

Length

Max length5
Mean length4.894358516
Min length3
ValueCountFrequency (%) 
Decimal_Number 9 81.8%
 
Dash_Punctuation 1 9.1%
 
Math_Symbol 1 9.1%
 
ValueCountFrequency (%) 
Common 11 100.0%
 
ValueCountFrequency (%) 
ASCII 11 100.0%
 

Occupation
Real number (ℝ≥0)

ZEROS
Distinct count21
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean8.07670688
Minimum0
Maximum20
Zeros69638
Zeros (%)12.7%
Memory size4.2 MiB

Quantile statistics

Minimum0
5-th percentile0
Q12
median7
Q314
95-th percentile20
Maximum20
Range20
Interquartile range (IQR)12

Descriptive statistics

Standard deviation6.522660487
Coefficient of variation (CV)0.8075891059
Kurtosis-1.216113649
Mean8.07670688
Median Absolute Deviation (MAD)5.772156967
Skewness0.4001401099
Sum4442738
Variance42.54509983
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 0. 1.5 2.5 3.5 4.5 ... 16.5 17.5 18.5 19.5 20. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
4 72308 13.1%
 
0 69638 12.7%
 
7 59133 10.8%
 
1 47426 8.6%
 
17 40043 7.3%
 
20 33562 6.1%
 
12 31179 5.7%
 
14 27309 5.0%
 
2 26588 4.8%
 
16 25371 4.6%
 
Other values (11) 117511 21.4%
 
ValueCountFrequency (%) 
0 69638 12.7%
 
1 47426 8.6%
 
2 26588 4.8%
 
3 17650 3.2%
 
4 72308 13.1%
 
ValueCountFrequency (%) 
20 33562 6.1%
 
19 8461 1.5%
 
18 6622 1.2%
 
17 40043 7.3%
 
16 25371 4.6%
 

City_Category
Categorical

Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size4.2 MiB
B
231173
C
171175
A
147720
ValueCountFrequency (%) 
B 231173 42.0%
 
C 171175 31.1%
 
A 147720 26.9%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Uppercase_Letter 3 100.0%
 
ValueCountFrequency (%) 
Latin 3 100.0%
 
ValueCountFrequency (%) 
ASCII 3 100.0%
 
Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size4.2 MiB
1
193821
2
101838
3
95285
4+
84726
0
74398
ValueCountFrequency (%) 
1 193821 35.2%
 
2 101838 18.5%
 
3 95285 17.3%
 
4+ 84726 15.4%
 
0 74398 13.5%
 

Length

Max length2
Mean length1.154028229
Min length1
ValueCountFrequency (%) 
Decimal_Number 5 83.3%
 
Math_Symbol 1 16.7%
 
ValueCountFrequency (%) 
Common 6 100.0%
 
ValueCountFrequency (%) 
ASCII 6 100.0%
 
Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size4.2 MiB
0
324731
1
225337
ValueCountFrequency (%) 
0 324731 59.0%
 
1 225337 41.0%
 

Product_Category_1
Real number (ℝ≥0)

Distinct count20
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5.404270018
Minimum1
Maximum20
Zeros0
Zeros (%)0.0%
Memory size4.2 MiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median5
Q38
95-th percentile13
Maximum20
Range19
Interquartile range (IQR)7

Descriptive statistics

Standard deviation3.936211369
Coefficient of variation (CV)0.7283520913
Kurtosis1.234756972
Mean5.404270018
Median Absolute Deviation (MAD)3.001889578
Skewness1.025734934
Sum2972716
Variance15.49375994
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 2.5 3.5 4.5 5.5 ... 16.5 17.5 18.5 19.5 20. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
5 150933 27.4%
 
1 140378 25.5%
 
8 113925 20.7%
 
11 24287 4.4%
 
2 23864 4.3%
 
6 20466 3.7%
 
3 20213 3.7%
 
4 11753 2.1%
 
16 9828 1.8%
 
15 6290 1.1%
 
Other values (10) 28131 5.1%
 
ValueCountFrequency (%) 
1 140378 25.5%
 
2 23864 4.3%
 
3 20213 3.7%
 
4 11753 2.1%
 
5 150933 27.4%
 
ValueCountFrequency (%) 
20 2550 0.5%
 
19 1603 0.3%
 
18 3125 0.6%
 
17 578 0.1%
 
16 9828 1.8%
 

Product_Category_2
Real number (ℝ≥0)

MISSING
Distinct count17
Unique (%)< 0.1%
Missing173638
Missing (%)31.6%
Infinite0
Infinite (%)0.0%
Mean9.842329251
Minimum2
Maximum18
Zeros0
Zeros (%)0.0%
Memory size4.2 MiB

Quantile statistics

Minimum2
5-th percentile2
Q15
median9
Q315
95-th percentile16
Maximum18
Range16
Interquartile range (IQR)10

Descriptive statistics

Standard deviation5.086589649
Coefficient of variation (CV)0.5168075075
Kurtosis-1.432266899
Mean9.842329251
Median Absolute Deviation (MAD)4.625958938
Skewness-0.1627577144
Sum3704948
Variance25.87339425
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
8 64088 11.7%
 
14 55108 10.0%
 
2 49217 8.9%
 
16 43255 7.9%
 
15 37855 6.9%
 
5 26235 4.8%
 
4 25677 4.7%
 
6 16466 3.0%
 
11 14134 2.6%
 
17 13320 2.4%
 
Other values (7) 31075 5.6%
 
(Missing) 173638 31.6%
 
ValueCountFrequency (%) 
2 49217 8.9%
 
3 2884 0.5%
 
4 25677 4.7%
 
5 26235 4.8%
 
6 16466 3.0%
 
ValueCountFrequency (%) 
18 2770 0.5%
 
17 13320 2.4%
 
16 43255 7.9%
 
15 37855 6.9%
 
14 55108 10.0%
 

Product_Category_3
Real number (ℝ≥0)

MISSING
Distinct count15
Unique (%)< 0.1%
Missing383247
Missing (%)69.7%
Infinite0
Infinite (%)0.0%
Mean12.66824321
Minimum3
Maximum18
Zeros0
Zeros (%)0.0%
Memory size4.2 MiB

Quantile statistics

Minimum3
5-th percentile5
Q19
median14
Q316
95-th percentile17
Maximum18
Range15
Interquartile range (IQR)7

Descriptive statistics

Standard deviation4.125337632
Coefficient of variation (CV)0.325644019
Kurtosis-0.8080661151
Mean12.66824321
Median Absolute Deviation (MAD)3.565943094
Skewness-0.7654458894
Sum2113329
Variance17.01841057
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
16 32636 5.9%
 
15 28013 5.1%
 
14 18428 3.4%
 
17 16702 3.0%
 
5 16658 3.0%
 
8 12562 2.3%
 
9 11579 2.1%
 
12 9246 1.7%
 
13 5459 1.0%
 
6 4890 0.9%
 
Other values (5) 10648 1.9%
 
(Missing) 383247 69.7%
 
ValueCountFrequency (%) 
3 613 0.1%
 
4 1875 0.3%
 
5 16658 3.0%
 
6 4890 0.9%
 
8 12562 2.3%
 
ValueCountFrequency (%) 
18 4629 0.8%
 
17 16702 3.0%
 
16 32636 5.9%
 
15 28013 5.1%
 
14 18428 3.4%
 

Purchase
Real number (ℝ≥0)

Distinct count18105
Unique (%)3.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean9263.968713
Minimum12
Maximum23961
Zeros0
Zeros (%)0.0%
Memory size4.2 MiB

Quantile statistics

Minimum12
5-th percentile1984
Q15823
median8047
Q312054
95-th percentile19336
Maximum23961
Range23949
Interquartile range (IQR)6231

Descriptive statistics

Standard deviation5023.065394
Coefficient of variation (CV)0.5422152805
Kurtosis-0.3383775656
Mean9263.968713
Median Absolute Deviation (MAD)4069.959166
Skewness0.6001400037
Sum5095812742
Variance25231185.95
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[1.20000e+01 1.25000e+01 1.35000e+01 2.45000e+01 2.55000e+01 ... 2.10805e+04 2.15685e+04 2.26535e+04 2.30405e+04 2.39610e+04], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
7011 191 < 0.1%
 
7193 188 < 0.1%
 
6855 187 < 0.1%
 
6891 184 < 0.1%
 
6960 183 < 0.1%
 
7012 183 < 0.1%
 
6879 182 < 0.1%
 
7166 182 < 0.1%
 
7027 182 < 0.1%
 
7165 180 < 0.1%
 
Other values (18095) 548226 99.7%
 
ValueCountFrequency (%) 
12 101 < 0.1%
 
13 106 < 0.1%
 
14 95 < 0.1%
 
24 118 < 0.1%
 
25 113 < 0.1%
 
ValueCountFrequency (%) 
23961 3 < 0.1%
 
23960 4 < 0.1%
 
23959 2 < 0.1%
 
23958 4 < 0.1%
 
23956 1 < 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

User_IDProduct_IDGenderAgeOccupationCity_CategoryStay_In_Current_City_YearsMarital_StatusProduct_Category_1Product_Category_2Product_Category_3Purchase
01000001P00069042F0-1710A203NaNNaN8370
11000001P00248942F0-1710A2016.014.015200
21000001P00087842F0-1710A2012NaNNaN1422
31000001P00085442F0-1710A201214.0NaN1057
41000002P00285442M55+16C4+08NaNNaN7969
51000003P00193542M26-3515A3012.0NaN15227
61000004P00184942M46-507B2118.017.019215
71000004P00346142M46-507B21115.0NaN15854
81000004P0097242M46-507B21116.0NaN15686
91000005P00274942M26-3520A118NaNNaN7871

Last rows

User_IDProduct_IDGenderAgeOccupationCity_CategoryStay_In_Current_City_YearsMarital_StatusProduct_Category_1Product_Category_2Product_Category_3Purchase
5500581006024P00372445M26-3512A0120NaNNaN121
5500591006025P00370853F26-351B1019NaNNaN48
5500601006026P00371644M36-456C1120NaNNaN494
5500611006029P00372445F26-351C1120NaNNaN599
5500621006032P00372445M46-507A3020NaNNaN473
5500631006033P00372445M51-5513B1120NaNNaN368
5500641006035P00375436F26-351C3020NaNNaN371
5500651006036P00375436F26-3515B4+120NaNNaN137
5500661006038P00375436F55+1C2020NaNNaN365
5500671006039P00371644F46-500B4+120NaNNaN490